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Abstract 

Direct policy search (DPS) and look-ahead tree (LT) policies are two widely used classes of techniques to 
produce high performance policies for sequential decision-making problems. To make DPS approaches work 
well, one crucial issue is to select an appropriate space of parameterized policies with respect to the targeted 
problem. A fundamental issue in LT approaches is that, to take good decisions, such policies must develop 
very large look-ahead trees which may require excessive online computational resources. In this paper, we 
propose a new hybrid policy learning scheme that lies at the intersection of DPS and LT, in which the policy 
is an algorithm that develops a small look-ahead tree in a directed way, guided by a node scoring function that 
is learned through DPS. The LT-based representation is shown to be a versatile way of representing policies 
in a DPS scheme, while at the same time, DPS enables to significantly reduce the size of the look-ahead trees 
that are required to take high-quality decisions. 

We experimentally compare our method with two other state-of-the-art DPS techniques and four common 
LT policies on four benchmark domains and show that it combines the advantages of the two techniques from 
which it originates. In particular, we show that our method: (1) produces overall better performing policies 
than both pure DPS and pure LT policies, (2) requires a substantially smaller number of policy evaluations 
than other DPS techniques, (3) is easy to tune and (4) results in policies that are quite robust with respect to 
perturbations of the initial conditions. 



Keywords: Reinforcement learning, direct policy search, look-ahead tree search 
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1 Introduction 



A wide range of techniques have been developed in the last decades for solving optimal sequential decision- 
making problems in the fields of optimal control and reinforcement learning. Two important classes of techniques 
that are known to work well on difficult problems characterized by large state spaces are direct policy search 
(DPS) H and look-ahead tree (LT) policies G3I 

In DPS, a policy is seen as a parameterized function that maps states to actions. In order to identify the best 
settings of the parameters, DPS techniques rely on local or global optimization algorithms that directly maximize 
the performance of the policy estimated through simulation or real-world experiments. DPS avoids learning value 
functions and all the complexities involved with approximating them. Instead, optimization is carried out directly 
in policy space. The rationale for doing this is that, while usually the value function will be a rather complex 
object and thus thought to be hard to approximate, a control policy will often be a much "simpler" object and 
thus could also be (so it is believed) much easier to learn. Note that in DPS approaches, making a decision is 
generally a very fast operation while all heavy computations are performed in an offline way, when optimizing 
the parameters of the policy. 

Look-ahead tree policies rely on a totally different principle than DPS, since they have no parameters that 
need to be learned in an offline way: all the computational work is done online during each decision-making 
step. In order to take each single decision, a LT policy proceeds in two steps: first, it develops a look-ahead tree 
by simulating the evolution of the system subject to multiple possible future action sequences, then, it selects 
an action based on the information collected in this tree. The quality of decisions taken by LT policies depends 
on the size of the tree: the more action sequences have been tried, the better the final decision will be. The 
computational complexity of an LT policy can thus be adjusted by changing the size of the look-ahead tree, small 
trees leading to fast approximate decisions and large trees leading to expensive high-quality decisions. 

Both DPS and LT techniques have their strengths and weaknesses. DPS can often produce, with compara- 
tively little effort, good policies for difficult control problems where typical value function based methods would 
otherwise fail. However, a major issue in DPS is that the final performance strongly depends on the choice of an 
appropriate policy representation. Common policy representations include linear parametrizations Il22ll . neural 
networks ll37l 1471 [191 or radial basis functions ||6l [14J and typically have hyper-parameters that require tuning 
(e.g. the number of hidden neurons). For a given problem, choosing an appropriate representation and tuning 
its hyper-parameters is a difficult problem, which typically involves a large amount of trial and error. On the 
other hand, this effort is not required with LT policies thanks to the look-ahead tree that acts as a unified built-in 
highly generic representation of the decision problem. Hence, LT policies can be applied to almost any sequential 
decision-making problem without any prior knowledge - this without even depending on the dimensionality of 
the state space. The main issue with LT policies is that they typically require a large amount of online compu- 
tational resources to produce high-quality decisions. In practice, these resources are often limited, either due to 
real-time constraints (e.g. 100 ms per decision) or human-scale constraints (e.g. one hour or one day per deci- 
sion). Given such a constraint and given the difficulty of the problem, the actions suggested by LT policies may 
be arbitrarily sub-optimal. 

LT policies alleviates the need to choose a complex approximation structure for policies, while at the same 
time DPS offers a principled way of reducing online computational requirements through an offline learning 
procedure. Starting from two these observations, it is natural to wonder how we can combine these two ideas in 
order to create a new hybrid solution for sequential decision-making that leverages the advantages of both. 

In this paper, we propose to bridge the gap between LT and DPS policies with a new hybrid technique: 
optimized look-ahead tree ( OLT) policie^l From the point of view of LT policies, our approach starts from 
the observation that the efficiency of an LT policy, given a finite computational budget, can be improved by 
directing the way in which the tree is grown, for example by giving priority to sequences of actions that seem 
more promising |[23l . In fact, a large number of different strategies could be considered to develop look- ahead 
trees, by combining aspects from breadth-first exploration to depth-first exploration and from random exploration 
to purely greedy exploration. In practice, the quality of these different strategies depend on the characteristics 
of the problem (e.g. to which extent are rewards informative about the long-term goal?), on the available online 

'This technique was initially proposed in 1321 and is here extended through a more mature exposition of the method and an extensive 
experimental study which covers much more aspects than the initial paper. Furthermore, we introduce the use of Gaussian process 
optimization to improve the sample efficiency of the approach. 



computational budget, and may even also depend on the current location in the state-space. Optimized look-ahead 
trees then consist in parameterizing a node scoring function that encodes the way in which look-ahead trees are 
grown and then learning the best values of these parameters in a problem-driven way through DPS. 

Traditionally, policies in DPS rely on an approximation structure, which is usually one of those used in 
supervised learning. These approximation structures lead to policies which are representable through simple 
mathematical formulas, such as the maximization of a linear dot-product or the computation of a feed-forward 
neural network. Note however that, choosing such a policy or an OLT policy does not make any conceptual 
difference for DPS: as any other policy, OLT policies may be viewed as functions that receive a state, perform 
some parameter-dependent internal computations and then return an action. The distinguishing feature of OLT 
policies is that their internal computations rely on explicitly exploiting a generative model of the problem. 

We experimentally compare our hybrid method with two other state-of-the-art DPS techniques, where the 
policies are represented through neural networks and adaptive radial basis functions, respectively. We also con- 
sider LT policies with four different exploration strategies. Based on an extensive study on four challenging 
benchmark domains, we show that our approach indeed leverages the advantages of the two techniques from 
which it originates. In particular, we show first that OLT policies perform often significantly better than both 
pure DPS and pure LT policies and that they are at least competitive in all cases. Second, thanks to their use of a 
model, it turns out that OLT policies require a substantially smaller number of parameters than neural networks 
and radial basis functions to reach high-performance policies. In turn, they require, as shown in the experimental 
results, significantly less offline policy evaluations than other DPS techniques. Third, we show that OLT policies 
are much more straightforward to train as compared to usual DPS which often requires cumbersome trial and 
error iterations to choose and train their approximation structure. Finally, we show that OLT policies are quite 
robust with respect to perturbations of the initial conditions. 

The content of the paper is structured as follows. In Section 12 we begin by introducing basic notation and 
state formally the type of sequential decision-making problems this paper is about. Section [3]presents a detailed 
description of optimized look-ahead trees. The node scoring function of optimized look-ahead trees can be 
learned using any derivative-free global optimizer. One such optimizer which was shown to be highly relevant 
to DPS is Gaussian process optimization lfT7ll29l . We give a brief (but for all practical purposes fully sufficient) 
summary of this approach in Section HI Section [5] presents the results of extensive experimental evaluations, 
wherein we compare optimized look-ahead trees with different pure DPS and pure LT techniques on a number of 
well-known benchmark domains. We present related work in Section[6]and finally conclude in Section[7J 

2 Problem statement 

This section introduces basic notation and formally states the problem we are going to solve. We consider a 
deterministic discrete-time system whose dynamics are given by xt+i = f(xt,at), where x is an element of 
some state space X and a is an element of some action space A. We do not make any assumptions about the 
form of the state space, since the presented approach can deal with any form of X. We do, however, make strong 
assumptions about the action space A: we require that it is finite, i.e. \A\ = K, and, for computational reasons, 
that the number of actions is small. In the examples we will consider later, X will be continuous and vector- 
valued, i.e., X C R™ 1 and each discrete action will be mapped to a specific continuous control vector (bang-bang 
control). 

The system evolves at discrete time steps with t = 1, 2, . . .; for every transition we make, we observe a scalar 
reward g(x t , at) which serves as a performance measure; we denote by B_ = mf{g(x t ,at) : (x, a) G X x ^4} 
and B = sup{g(xt, at) : (x,a) € X x A}, without however assuming that they are finite. Let tt : X — > A 
denote a stationary policy, i.e. a mapping from states to actions. For any given policy it and state xq, the infinite 
horizon discounted sum of rewards V n (xo) is denned as 



t=o 

and where < 7 < 1 is a discount factor. The optimal value function is denned as the maximum over all policies, 
V*(xq) := max^ V 7T (xq), and satisfies the discrete-time Hamilton-Jacobi-Bellman (HJB) equation 



T 




(1) 



V*{x) = max 



a 



[g(x,a)+ 7 V*{f(x,a))] Vx € X. 



(2) 



If we are able to determine the solution V* of Eq. the optimal action for every state x can be derived as 
ir*(x) = argmax a [g(x, a) + 7V* (/(x, a))] . However, solving Eq. © exactly is possible only in some cases 
(e.g., LQR); in the general nonlinear case and for higher dimensional spaces this presents an open research 
problem. 

DPS approaches do not try to estimate V n or V*. Instead, they solve the conceptually simpler problem of 
finding a policy that "works well" for some initial conditions. In our case, we define this objective as follows: 
given a set of initial states Xq C X, our goal is to find a policy that maximizes the performance over the stated 
in Xq, i.e., we want to find 

argmax ^ V k {xq) (3) 
x ex 

(or argmax,,. Ylx o ex p( x o)V w (xo), where p(xo) > are some weights). 

In order to solve this problem, one assumes that policies are functions ttq := tt(- ; 9) parameterized by some 
vector 9 G K, rf . The optimization over policies in Eq. © is thus turned into an optimization over real-valued 
vectors 

argmax V Xo (9) := ^ V ne (x ). (4) 
fleRd xoexo 

Given a vector 9, the objective function Vx can be evaluated by simulating the system under the policy ttq from 
all the states in Xq and summing the rewards as in Eq. £[])■ Note that, to avoid the infinite sum, we have to 
truncate the infinite horizon to a (typically large) number of finite simulation steps H. The objective function in 
Eq. © is thus effectively replaced by 

H 

argmax V Xq {9) := ^ ^2j t g(x t ,ir(x t ;9)), where Vi : x t+ i = f(x t ,n(x t ; 9)). (5) 
9 ^ d xoexo t=o 

The novelty of this paper is that we consider policies 7r(- ; 9) which, instead of being defined through function 
approximators, are non-trivial algorithms that at some point depend on 9 and on the generative model (/,£»). To 
emphasize this requirement, we will write 7T/,o(- ; 9) whenever we refer explicitly to an OLT policy. 



3 Optimized Look- Ahead Trees 

Given a state x t , we now describe a way of selecting action 7Tf i8 (xt;9) based on the construction of a look- 
ahead tree. The construction of the tree will, in general, be non-uniform and is controlled by a function that is 
parameterized by 9 (non-uniform meaning that leaf nodes are not necessarily at the same depth). 



3.1 Notation 

Let K be the number of possible actions and T denote the look-ahead tree. T is composed of nodes n^h G T, 
where h denotes the depth and i denotes the index within the depth h (i.e., i G {1, . . . ,K h }. The state xt 
for which we want to find action iTf^(xt;9) is placed in the root node n\fl. Each node corresponds to a 
particular sequence of actions and states. The children of n^h are generated by applying one of the K actions: 
assume G {1, . . . , K} is the index of the action taken at depth h, then the child of is n x ^_ 1 - )+a (h) h+l . 
For each node n^h there exists one unique path from the root ra^o to n^: 

path from 711,0 to n i)h : ni, — > n ht i — > n i2i2 — ► • • • — > n ih)h = n i)h 

which corresponds to a particular sequence of actions a^^a^, . . . taken at each intermediate step and 
which induces a partial trajectory of length h: 



x t - L t + 1 ► x t+2 ■ • ■J't+h-l ► x t+h 



2 Note that in the experiments we report on later we will only use a single initial state. In the general case, having multiple initial states 
tends to increase the robustness of the policy. 



The successor states are generated according to the transition function /, e.g., sc°_A '"° * = / (x" V^i' f ' X> , a^- 1 ) 
and rewards according to the reward function g. For each node we define ^(n^/j) to be the reward obtained 

in the last step: ^(n^/j) := g(xf^zf h 1) ,a^). 

Every time we expand a node, we generate all of its children. A node whose children have been generated is 
called an inner node. Otherwise it is called a terminal nod^. We denote T = 7Tnner U 7i ea f ■ 

3.2 Using look-ahead trees to make decisions 

We first describe how to use the information contained in a look-ahead tree to select an action. For each terminal 
node rii t h G 7i ea f we define the l-score as the discounted sum of rewards obtained along the path from the root 
node ni o to n^h plus a lower bound on the cumulated rewards not yet observed: 

h oo h , . ^ 

Vn iih G 7!eaf : £(m, h ) ■= Yl e("**,*)7* _1 + &f~ X = Sin^th'' 1 + < 6) 

t=l t=h+l t=l ^ 

For each non-terminal node n^h G 7i nn er we define the ^-score recursively as the maximum of the ^-scores of its 
children (see Figure [T]): 

Vn i)h G 7Inner : £{n ijh ) := max i(n). (7) 

nSchildren(ni ^) 

The ^-score of the root node n^o (which corresponds to state x t ) is a lower bound on the optimal value 
V*(x t ) |[23l . Given a look-ahead tree, we adopt the conservative strategy that consists in selecting the action that 
leads to the successor state with maximal lower bound. With the naming scheme for nodes introduced above, we 
can write £(ni t o) = max„ £(n a> i), and thus 

TTf, e (xt) = argmax l(n a>1 ). (8) 

a 

In the rest of this paper, we generally assume that = and that B is finite, although neither of these conditions 
is a requirement for our OLT method. 

3.3 Developing look-ahead trees 

We now discuss the construction of look-ahead trees. If the rewards are upper-bounded (B is finite), we can 
also upper bound the optimal values V*(x). For each terminal node rtj/j G 7i ea f we define the u-score as the 
discounted sum of rewards obtained along the path from the root node ni o to n^, plus an upper bound on the 
cumulated rewards not yet observed: 

Vn hh G 7Teaf : u(m th ) := £ eK.th' -1 + £ W~ X = t{m,h) + {B ~ Rh \ (9) 

t=l t=h+l ^ 

For each non-terminal node ^ G Tinner 

we define the u-score recursively as the maximum of the u-scores of its 

children (see Figure [j}: 

Vn^ G 7Tnner : u(n iyh ) := max u(n). (10) 

ra£children(ni ^) 

See Figure Q] for a graphical illustration of a look-ahead tree with associated ^-scores and n-scores. The it- 
score of the root node m ; o is an upper bound on the optimal value V*(x t ). Furthermore, the ^-score of the root 
increases with the number of expanded nodes in the tree, while at the same time its u-score decreases ll23l . In 
other words, the more nodes the tree has and the farther we develop it into the future, the tighter our bounds on 
the optimal value will become^ and thus the better a decision we can hope to make with Eq. ©. (Bear in mind 

3 Note that the following terminology is equivalent: 

non-terminal node = inner node = explored node = closed node 
terminal node = leaf node = unexplored node = open node. 



4 The tightness of the bounds will also depend on 7. The closer 7 is to 1, the more nodes we will have to expand. However, there is 
little we can do about it, since 7 is given in advance. 



Uniform look-ahead tree 
K = 2 actions 



B = 0, B = 1, 7 = 0.9 



it-scores and ^-scores 

1-7 ■ 



B 
1-7 
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=> V*(a; t ) £ [1.22,9.32] 




Figure 1: Left side: shows a uniform look-ahead tree developed from state xt for h = 2 stages and K = 2 
actions. In it, nodes are labeled by n^h and correspond to states; edges correspond to choosing an action. Each 
node encodes the result of taking a particular sequence of actions; e.g., the node 712,2 corresponds to state x^^ 2 , 
which is the result of starting from state xt and choosing action a\ in the first stage and action 02 in the second 
stage. Right side: illustrates how ^-scores and u-scores are calculated (which form an upper and lower bound 
for the true value V*{xt) of the root node). The plot shows a non-uniform tree where the terminal node with 
the highest u-score is expanded first. Edges are labeled with the rewards g associated with the transition they 
represent. Note that the ^-scores and u-scores are calculated only for terminal nodes and propagated back to their 
parent via the max operator (denoted by the arc). 



that, unlike search trees for constraint satisfaction problems which are grown up to a goal state, our look-ahead 
trees are grown until a predefined computational budget is exhausted.) 

Knowing that, the big question is: how should we develop the tree such that we arrive at a near-optimal 
decision within the allowed computational budget? 

The typical way of developing a look-ahead tree is to build a uniform tree, by expanding nodes with a breadth- 
first strategy. However, to fully develop a tree of depth h we need K° + K 1 + K 2 + . . . + K h node expansions 
which for growing h will rapidly become computationally infeasible. 

Best-first search is the common solution to this dilemma; it develops trees non-uniformly and only expands 
those nodes to a deeper depth that look "promising". In order to do this, the algorithm relies on a node scoring 
function e : 7T e af — > El that is used to assign a certain score to every terminal node n^h € 7T e af ; whenever we want 
to decide which terminal node to expand next, we compare all the scores and choose the node with the highest 
score. Note that the default uniform development strategy can be obtained as a particular case of best-first search 
with the following node scoring function: 

e wi/OTm K h ) := -h. (11) 

The work in ll2~3l suggests to use as scoring function the u-scores of the terminal nodes (which makes the 
method become closely related to A*): 

e°P timistic (n i>h ) := u(n ijh ). (12) 

The authors of [23 ] show theoretically that non-uniform trees developed by the u-score will never perform worse 
than uniform trees for the same budget of expansions, but certain conditions must be met to make them perform 
better. The extent to which these conditions are fulfilled is problem-specific and depends on how "informative" 
the reward is (which in turn depends on the coupling of dynamics and rewards). Informally speaking, a problem 



will be difficult (i.e., the rewards will be non-informative) if there are many nodes with the property that, given 
the observed rewards from the root to the node in question, one cannot decide whether the node lies on an optimal 
path or not. This is, for example, the case for reinforcement learning problems with a flat reward structure, such 
as in the mountain car domain (where every transition has reward and only entering the goal gives +1). On 
the other hand, having an informative reward is not a totally exotic requirement; for many control problems an 
informative reward comes as the natural definition of performance (e.g., the many pole balancing or inverted 
pendulum domains, where rewards are taken as a quadratic function of angle and angular velocity). 

3.4 Parameterizing the development of the look-ahead tree 

It turns out that a large number of strategies could be used to develop look-ahead trees and it is probably the 
case that there exists no single best strategy for all problems. Instead of searching for the best possible generic 
strategy, we adopt the approach first introduced in |[32ll . which consists in learning a specific look-ahead node 
development strategy in a problem-driven way. 

Optimized look-ahead trees rely on a parameterized node scoring function e(- ; 0) : 7i ea f — > where 9 G R d 
is a parameter vector. This function should be flexible enough to represent a large variety of tree development 
strategies. In particular the parameters should encode important aspects of search, such as the extent to which 
depth-first search is preferred over breadth-first search, or how short-term rewards should be used to bias search. 
Furthermore, since different search strategies may be optimal in different regions of the state space, we would like 
e(- ; 9) to enable state-dependent strategies. To meet these requirements, as in ll32l . we take the parameterized 
node scoring function to be a simple weighted sum of features extracted from the information encoded in the 
path from the root node to the node in question. 

In our investigations, the node scoring function is defined for each terminal node n^/j G 7I e af in the following 

way: let x^'" a(h) = {xf^_ h , . . . , a^™^) G IR" -1 denote the n x -dimensional state corresponding to rii t h, and 
Qt+h = @( n i,h) denote its reward. We consider three blocks of features: the first n x features merely correspond 
to the components of the state, the next n x features correspond to components of the state multiplied by the reward 
(enabling the tree to be grown in a more or less directed way, and the final set of n x features corresponds to the 
components of the state multiplied by the depth h (enabling to control breadth/depth trade-off). Let 9 G IR 3 ™ 1 be 
the vector of parameters. The parameterized scoring function e(- ; 9) can then be written as 

Vri i)?i G 7ieaf : e(n i:h ; 9) := ^ x^ h (9j + 9 nx+j ■ g t+h + 9 2nx +j • h). (13) 

j'=i 

Notice that with this linear parameterization the outcome of which node is selected is invariant under scaling of 
the parameter vector by positive scalars. Of course, other kinds of features are also possible and may in fact turn 
out to be more suitable in some cases (this may be an avenue for future research). 

3.5 Connection with the optimal value function 

If in place of scoring function e(- ; 9) we would have used the optimal value function V* to decide which node to 
expand next, the algorithm in Eq. ([8]) would have returned the optimal action for any number of node expansions 
> 1. In this sense, scoring function e(- ; 9) can be seen as a weaker form of value function. It is weaker because 
it does not have to assign the precise quantity "discounted future sum of rewards"; any number will do as long as 
it helps to grow the tree in roughly the right direction. 

It is for this reason that we believe it could be advantageous to try to learn a good scoring function instead 
of trying to directly learn/approximate the optimal value function: the former can be a rather simple function (as 
evidenced in the good results we get in Section [5] where we use for all domains exactly the same simple weighted 
sum of features), whereas the latter would require a far more expressive parametrization (and it is well-known 
that value function approximation scales badly when the dimensionality of the state space grows HTTP . 

3.6 Summary: the algorithm 

Figure |2] presents a simple algorithm based on a sorted list to implement policies as parameterized look- ahead 
trees. The algorithm requires as input a state xt and the parameter vector 9 and returns the action 7r/ e (x t ; 9). It 



Input: state x t , weights 9 
Output: policy action 7T/ lB (xt; 9) 

Depends on: 

/ transition function 

q reward function 

7 discount factor 

e(- ; 9) parameterized score function 

fomax maximum number of node expansions 



1. Initialize 

list=0 

for i = 1 . . .K I* Expand root node for all actions */ 

x := f(x t ,Oi); r' ~ g(x t , a;); h! := 1; e := e(x' , r',ti; 9) 
generate node n: 

n.x := x'; n.l := r'; n.h := h'; n.e := e; n.ir := a,i 
add n to list 

2. Main loop 

While number of node expansions j < h max (the max number of node expansions) 
Find n* := argmax ngHst n.e and remove n* from list 
for i = 1 . . . K I* Expand node n* for all actions */ 

x' := f(n* .x,di); r' := q(ti* .x, en); h' := n* .h + 1; e := e(x' , r', h'; 9) 
generate node n: 

n.x := x'\ n.l := n* 1 + ^ h ~ 1 r' '; n.h := h! ; n.e := e; n.7r := n*.7r 
add n to list 

3. Get best action 

return nf te (xt\ 9) := n* .it where n* — argmax ngHat n.l I* ties broken randomly */ 



Figure 2: Implementing a policy represented by a parameterized look-ahead tree. 

depends on the allowed number of node expansions h max , the domain (represented by generative model /, g) 
and the discount factor 7. The computational complexity for evaluating 717^(2^; 6) for a budget of h max node 
expansions is obtained as follows. Let D be the cost for one call to the generative model and E be the cost 
to recursively update the internal scores of a node. We assume that the leaf nodes are stored in an appropriate 
data structure which allows incremental insertion in logarithmic time and allows finding the maximimum of the 
scores in constant time (such that for every iteration of the main loop, if the the list of terminal nodes contains N 
elements, we can find the maximum score and update the structure with log N operations). Note that for every 
iteration j = 1, 2, ... of the main loop, our list of terminal nodes contains (K — l)(j — 1) + K elements, which 
can be shown by simple induction (every time we expand a node we remove one element from the list and add 
K new ones). At each iteration j of the main loop we do the following: we first find the best node to expand by 
looping over the list of current terminal nodes. We then generate all of its K successors, where each one costs D 
for having to call the generative model and E to recursively update its internal scores. After the main loop has 
run for /i max times, we have to find the best ^-score from the list. The computational complexity of the algorithm 
is thus 

^rnax 

[K(D + E) + log{(K - l)(j - 1) + K}] + (K — l)(Vax -1) + K< 

3=1 

h max (K(D + E) + log{(if - l)(/w - 1) + K] + K - 1) + 1. (14) 

In the particular case where we are expanding by the heuristic uniform (i.e., breadth-first), we skip the maximiza- 
tion within each iteration of the main loop. The computational complexity then becomes 

^raax 

[K(D + E)} + (K — l)(/w -1) + K = h mm (K(D + E) + K - l) + 1. (15) 

3=1 



Node n is a struct consisting of fields: 
n.x state 
n.h depth 

n.l ^-score=cumulative reward 

n.e tree development score from e(- ; 9) 

n.n first action on path 



Note that in order to expand uniformly, h max must be equal to one of K°, K° + K 1 , K° + K 1 + . . . K d where 
d is the depth of the tree. 

4 Gaussian Process Optimization 

We now turn to the problem of solving Eq. ©, i.e., find a vector 9 such that the induced policy ir(- ; 9) (globally) 
maximizes the score Vx o (0) over the set of initial states X$. In general, this will be a difficult optimization 
problem. First, the dependence of Vx on 9 can be a complex one with local extrema occuring frequently; 
thus we may need to sample the search space exhaustively. Second, there is no closed-form expression for 
evaluating the objective function Vx or its gradient; instead we have to simulate the system (or run real-world 
experiments), which is expensive. Among the many alternatives that have been proposed in the past for this 
purpose, such as cross-entropy [44], various stochastic search alternatives ll20l . or Lipschitzian optimization 
|[39l , Gaussian process optimization (GPO) is considered to be one of the most efficient methods to optimize 
expensive functions O. 

The purpose of this section is to provide the required background in GPO to the reader who is not aware 
of this. Note that in our experiments we will illustrate the optimization of look-ahead trees with both GPO and 
the cross-entropy method, the latter being much simpler to understand and to implement than the former. Since 
choosing a policy representation and optimizing its parameters are two separate tasks, we keep the description of 
GPO general and independent of the specific nature of the OLT policies. Readers less interested in the optimiza- 
tion part may safely skip this section and directly jump to Section [5] 

4.1 Notation 

To enhance the readability of this section and to conform with the standard notation used in the literature, we 
will define a local notation for this section which overlaps with what we use in the remainder of the article. 
Specifically, we will now write f(x) for the objective function we want to maximize (instead of Vx {9)) and x 
for its input arguments (instead of 9). 

4.2 Overview 

GPO is an iterative sampling-based search procedure which constructs a surrogate for the objective function and 
optimizes that in place of the original one. GPO is able to incorporate prior knowledge about the problem and 
provides a principled way to build and exploit the surrogate so as to trade off exploration and exploitation of the 
search space. 

Each iteration of GPO consists of two steps. Assume that at iteration n we have already sampled the objective 
function at locations x±,...,x n with values f(x\), . . . , f(x n ). The first step is to fit a regression model using 
Gaussian process regression to the samples gathered so far, that is, to fit a regression model to the training data 
T> n := {(xi, f(xi)}f =1 . Let us call the resulting model GP n . The second step is to use the regression model 
GP n as input to a scoring heuristic U (the so-called acquisition function) to find the most promising point x n+ \ 
at which to evaluate the objective function next. Typically, acquisition functions are defined in an optimistic way 
such that high scores correspond to potentially high values of the objective function. 

The regression model GP n acts as a surrogate of the objective function: it can be evaluated at any given 
point x of the search space to produce an estimate for f(x) (more precisely, GP n will produce a distribution over 
f(x)). The reason for building the surrogate is that, unlike the true objective function, it is computationally very 
cheap to evaluate (it can be done analytically and in closed form). Thus we can afford sampling it as exhaustively 
as necessary to find the maximum. On the other hand, since the values the surrogate produces will only be 
estimates, we can never be sure that what we get from maximizing the surrogate is indeed the maximum of the 
objective function or even close to it. Instead, we have to take into account how accurate the estimates produced 
by GP n are; this in turn will depend on the general smoothness of the objective function and the number of data 
points we have collected in the neighborhood of x. 

Each time GP n is evaluated at a location x, it outputs two values /J, n (x) and cr^(x) which together define the 
Gaussian predictive distribution N(fj, n (x), <Jn( x )) over values f(x). The mean of this distribution, /x n (x), can 
be directly taken as point estimate for f(x). The variance of this distribution, (T^(x), can be taken as a measure 



Input: objective function Vx a { ) taking arguments 9 G R d (from Eq. I0) 
Output: 6* « argmaxg Vx (0) 

Depends on: 

Acquisition function U that takes as functional input a GP model and as argument a vector 6 € R d 



1. Initialize 

Generate initial sample locations 8i, . . . , 9 n by random sampling or space filling methods 

(e.g., Latin hypercube sampling) 
Evaluate the objective function in9i, ■ ■ ■ , n to produce training data T> n := {{Oi, Vx o {0i))}i'^i 

2. Main loop i = n, n + 1, . . . /* loop until we run out of computational resources */ 

Fit GP model to the current training data: 

GPi = Gaussian process regression on training data T>i 
Get most promising next sample location: 

9 i+ i := argmaxg {UGPi){0) /* using e.g. DIRECT */ 
Evaluate objective function Vx in 0i+i, add result to training data, and repeat: 

T> i+ i =ViL> {(e i+ i,Vx {e i+ i))} 

3. Return 

Training point in D„ with highest score: 
argma X()ex , n Vx o (0) 



Figure 3: Implementing Gaussian process optimization. 

of how certain the GP is about this estimate. Both /U n (x) and cr^(x) will be used by the acquisition function hi to 
assign a score to x, which we will write as (hi GP n ) (x). To determine the next sample location x n+ i , we thus 
have to determine 

x n+ \ := argmax (U GP n )(x) (16) 

X 

which, unlike Eq. ([5]), can be solved efficiently by any black-box global optimization method (in our experimental 
studies we will use DIRECT |[39l ). 

In the following two sections we will describe each of these steps in more detail; a summary of the algorithm 
is also given as pseudo-code in Figure |3] 

4.3 Using Gaussian processes for regression 

Here we briefly review how GPs can be used for function estimation ll42l . Note that we will present GPs for the 
more general case of a stochastic objective function; while the problems we consider later are all deterministic, 
this leaves the door open for stochastic returns in future work. 

Suppose we are looking for a function f : X C R rf — » HI from which we have observed noisy samples 
(xi, y±), . . . , (x n , y n ), where Xj S X is the input and yi = /(xj) + Si the output corrupted by independent 
zero-mean Gaussian noise with common variance o"q, i.e., £j M(0, <7q). To estimate the function value /(x) 
at any given input location x, we proceed as follows. We suppose that the sought function is a realization of 
a zero-meanE] Gaussian process with covariance function k$(x,x'), which we write as / ~ QV(0, k$(x, x')), 
where ■§ is a vector of hyper-parameters (as explained below). Hence, the vector of function values at the n 
observed input locations is assumed to be drawn from a joint Gaussian distribution 

(/(x!),...,/(x n )) \X,# ~ AA(O nxl ,J^), (17) 

where X := [xi , . . . , x n ], is the n x n covariance matrix with entries [K^]ij = k$(xi,Xj). The covariance 
function •) can be thought of as a way to encode our prior information about the "smoothness" of the 
functions / we believe to come up; typically k$(x, x') is chosen as a function of the distance ||x — x'|| in which 



5 The assumption of zero-mean is made here for notational convenience. (Centering the data combined with an ergodicity assumption 
allows to avoid a non-zero mean.) 



case k$ measures the expected amount of variation of the function values f(x), f(x') in terms of the distance 
between the locations x and x' . 

A typical choice for k is the translation-invariant isotropic squared exponential, which is of the form 



k$(x,x') := t>oexp{— 0.5(x — x') — x')} 



(18) 



and which itself is parameterized by hyper-parameters vq > and Q, = diag(ai, . . . ,a^), a; > 0. All hyper- 
parameters specifying the GP are thus collected in the vector := (vq, oi, • • • ; fld) which together with <7q 
characterizes the prior distribution of our noisy samples. The actual "training" of a GP thus consists of finding a 
good pair <7q) from the data, for which various alternative procedures exist in the literature: here we stick to 
the most common one which is the optimization of the marginal likelihood. See B2l for a detailed description. 

Now suppose that we know -d and <7q. Since the noise is zero mean Gaussian, white, and independent of the 
function /, it follows from Eq. (fTTT ) that the n x 1 vector of observed outputs Y := [yi, . . . , y n ] will also be 
jointly Gaussian and distributed as follows 



Y\X,#,o% ~ N{Q nX i,K# + all nxn ). 



(19) 



Furthermore, the joint distribution of the function value f{x) at query location x and the vector of observed 
values of Y is also Gaussian and characterized in the following way: 
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K# + a%I nxn k»(x) 
k${x) T 



H 



(20) 



where the n x 1 vector k$(x) is defined by k$(x) := [fc#(a;, x\), . . . , k&(x, x n )\ T and scalar by k,$ :- 
k$(x, x). Conditioning f(x) on Y, we thus obtain 



f(x)\X,$,alx,Y ~ AA(/i(x),a 2 (x)) 



where 



:= k#(x) T (K# + cro/nxn) 



(j 2 (a;) := - k$(x) T (K$ + ao4xn) fc^(ac). 



(21) 

(22) 
(23) 



Thus given (noisy) observations (x±, y\), . . . , (x n ,y n ) from an unknown function /, with GP regression we first 
infer from the observations hyper-parameters $ and o"q, and then obtain for any new point x the distribution over 
function values p(f(x)\X, cc, y) = J\f(fi(x), a 2 (x)). 



4.4 Choosing an acquisition function 

Early work ll24l l35l suggested to take as acquisition function the probability of improving over the current max- 
imum x + := argmaXg.g.p f(x) within the training data V n . The resulting PI acquisition function is given by 
P(f(x) > f(x + ) + Q, which we write as 

{PI GP n ) (x) := $o,i f ^)-f^ + )-C \ (24) 

where fi n (x) is the mean and a\ (x) the variance of the predictive distribution as output by GP n for point x (see 
Eq. (1221 and Eq. (1231). and $0,1 is the standard normal cumulative distribution. The trade-off parameter £ > 
controls the compromise between the strength of the potential improvement and its probability to be realized. 

An alternative acquisition function is the expected improvement EI, which can be evaluated analytically Il24l . 
giving 

(EI GP n ) (x) := a n (x) (Z$ ,i (Z) + <h,i (%)) (25) 

where Z := (n n (x) — f(x + ) — Q/a n {x), and is the probability density of the standard normal distribution. 

Figure 0]illustrates how GPO works and how PI and EI give rise to distinct sampling behavior using the same 
GP and data. Which of the acquisition functions works best for a given problem is usually difficult to say in 
general and needs experimentation; in our examples in Section [5] we have found that EI worked best. 
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(a) n = 4 (after 3 function evaluations) 



(b) n = 5 (after 4 function evaluations) 
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(c) n = 6 (after 5 function evaluations) 



(d) n = 7 (after 6 function evaluations) 



Figure 4: A concrete example illustrating Gaussian process optimization over four stages n = 4, 5, 6, 7. Each 
panel consists of two subplots. Each upper subplot shows the true (unknown) function we want to optimize as a 
dashed curve and the locations at which it was previously evaluated as black filled dots. Using these as training 
data V n in a GP, the black curve denotes the expected value (mean) of the resulting predictive distribution from 
Eq. (1221) and the shaded area denotes its variance from Eq. d23l . both evaluated at all locations x G [—5,5]. 
As one would expect, the predictive variance (and thus the uncertainty about the estimates) is close to zero at 
"known" locations (it would be equal to zero if ctq = 0) and grows the farther one gets away from them. Each 
lower subplot shows the result of evaluating the acquisition function for one of the two possible choices EI or 
PI. Note that each of these would lead to a (slightly) different sampling behavior; here we have chosen the next 
sample location (marked in the figure by the star-shaped symbol) as the one that maximizes EI. 



5 Experiments 

This section reports an extensive evaluation of optimized look-ahead trees. The purpose of the experiments is to 
compare this approach to both pure DPS and pure LT approaches and to demonstrate its strengths across a variety 
of domains. To this end we have chosen four challenging benchmark problems, which are described in Section 



I5.lt inverted pendulum, double inverted pendulum, acrobot handstand and HIV drug treatment. We compare our 
approach to two of today's standard methods for DPS and to four generic LT policies, all of which are described 
in Section 1531 The comparison is structured as follows: 

• In Section 1531 we compare the OLT approach to both state-of-the-art DPS approaches and LT approaches 
and show that OLT policies often significantly outperform these techniques and that they are at least com- 
petitive in all cases. 

• In Section 15.41 we study the sample efficiency of OLT learning according to both the number of policy 
evaluations and the number of simulated transitions. We show that most of the time OLT requires less 
samples than its DPS competitors to reach high-performance policies. 

• In Section [531 we compare how the different policy parameterization behave w.r.t. their respective hyper- 
parameters and show that optimized look-ahead trees require much less trial and error effort than the other 
studied DPS approaches. 

• Finally, in Section [531 we study the robustness of LT, standard DPS, and OLT when the initial states of the 
testing set differ from the initial states of the training set and show that OLT policies are quite robust w.r.t. 
to both small and large perturbations of the initial state. 

5.1 The benchmark domains 

We consider four well-known challenging benchmark domains that exhibit some common characteristics: a con- 
tinuous vector-valued state space, finite (discretized) actions, deterministic transitions and rewards that to some 
extent are informative about the goal. Each domain corresponds to a real physical process (either mechanical or 
bio-chemical) internally described by a system of nonlinear differential equations, none of which can be solved 
by traditional control methods such as LQR. The transition function for reinforcement learning is obtained by dis- 
cretizing in time and keeping the controls constant; the actions are obtained by discretizing the bounded control 
space. 

Inverted pendulum Our first domain is the inverted pendulum, a simple enough toy problem that is widely 
used in benchmarking different algorithms. The goal is to swing up and stabilize a single-link inverted pendulum 
as is shown in Figure [5^. As the motor does not provide enough torque to push the pendulum up in one single 
rotation, the pendulum needs first to be swung back and forth to gather energy before then being pushed up and 
balanced. This creates a nonlinear control problem. The state space is 2-dimensional, x = (x±,X2) = (4>, <t>), 
with <fi € [— 7r, 7r] being the angle, and <p £ [—10, 10] being the angular velocity. The control force is discretized 
to a £ {—5, —2.5, 0, +2.5, +5} and held constant for At = 0.2sec. The dynamics of the system and physical 
parameters we used to instantiate the problem are specified in Appendix [A] 

To formulate this task as a reinforcement learning problem of the form given in Eq. ([5]), we used the following 
settings: the set of initial states is the singleton Xq = {(tt,0)}, the reward is defined as g(x(t),u(t)) := 1 — 
0.1((/)(i) 2 — 0.l<j)(t) 2 — O.lit(t) 2 ), the discount factor is set to 7 = 0.99, and each policy is evaluated for H = 500 
steps. 

Double inverted pendulum The next domain is a more complex variant of the inverted pendulum given 
above. This time we have two poles mounted each to a separate cart as depicted in Figure [5t>. The two carts 
are linked by a spring and allowed to move some distance horizontally on the x-axis (until they collide with 
a wall which leads to a failure). Each cart is controlled separately; however, because of the spring their dy- 
namics is coupled. As in the inverted pendulum, the goal is to swing up and stabilize the poles as quickly as 
possible, but now by moving the carts back and forth. This creates a rather challenging nonlinear control prob- 
lem. The state space is 8-dimensional, x = (x\, X2, Ai, £2, 61, 62, #1, ^2), with Xi 6 [—1, 1] being the position 
and ±i € [—10, 10] the velocity of the i-th cart, and with 9i 6 [0, 2tt] being the angle and 0j 6 [—5,5] be- 
ing the angular velocity of the i-th pole. The two dimensional control vector is discretized to the four actions 
a e {(-2, -2), (-2, +2), (+2, -2), (+2, +2)} and held constant for At = O.lsec. The dynamics of the system 
and physical parameters we used to instantiate the problem are specified in Appendix [Bj 



Goal: 

upright balance 




Figure 5: From left to right: the inverted pendulum task, the double inverted pendulum linked with a spring task, 
and the acrobot handstand task. 

To formulate this task as a reinforcement learning problem of the form given in Eq. (0, we used the fol- 
lowing settings: the set of initial states is the singleton Xq = {(0, 0.5, 7T, it, 0, 0, 0, 0)}, the reward is defined as 
g(x(t),u(t)) := [(1 + cos 9\{t)) + (1 + cos f?2(i))]/4, the discount factor is set to 7 = 0.999, and each policy is 
evaluated for H = 250 steps. In order to achieve a high reward in this domain, the policy has to both balance the 
poles and not collide with one of the walls. Low rewards usually occur because of such collisions. 

Acrobot Our third domain is the acrobot from fl4rjfl : a two-link robot that resembles a gymnast swinging up 
above a high bar (see Figure \5}p). The acrobot freely swings around the first joint (the hands grasping the bar) and 
can exert force only at the second joint (bending the hips). The acrobot is an underactuated system; the task we 
consider here is the inverted "handstand" position, which is hard to solve using reinforcement learning methods. 
The state space is 4-dimensional, x = (#1, 02, 0\, O2), with Q\ and 62 being the angle of the upper and lower 
link, and 6\ and Q\ their angular velocity, respectively. The continuous control is discretized to {—1, +1} and 
held constant for At = 0.2sec. To facilitate staying stable in the inverted handstand position (a highly unstable 
equilibrium), we also include a third non-primitive "balance" action, which chooses control values derived from 
an LQR controller obtained from linearizing the system dynamics about the handstand position. Note that this 
balance action produces meaningful outputs € [— 1,+1] only very close to the unstable equilibrium and thus 
cannot be used to bring the acrobot from the initial state to the goal region. The dynamics of the system and 
physical parameters we used to instantiate the problem are specified in Appendix ICl 

To formulate this task as a reinforcement learning problem of the form given in Eq. (f5]), we used the following 
settings: the set of initial states is the singleton Xq = {(— n/2, 0, 0, 0)}, the reward is defined by the height of 
the end of the second link (the feet) as g(x(t),u(t)) = 2 + cos(0i(t) - vr/2) +cos(0 1 (t) + 2 (t)- vr/2) (+100 
if handstand), the discount factor is set to 7 = 1, and each policy is evaluated for H = 500 stepo 

HIV drug treatment Our last problem domain is taken from a real- world application in medical control HI. 
The aim is to optimize the treatment of a patient infected by HIV over a period of a few years using what is known 
as structured treatment interruption (STI). The treatment of the patient consists of choosing a combination of two 
drugs, yielding 4 possible choices (including the case where no drug at all is taken), to administer every 5 days. 
Administering the correct cocktail with the correct timing can hinder the spread of HIV infected cells and will 
eventually bring the patient into a healthy state (a locally stable equilibrium); however, the drugs also have side 



6 Note that since 7 = 1, the criterion that we optimize for this problem is the (time-independent) finite sum of rewards. Changing the 
criterion in this way does not make any technical difference for any of the direct policy search techniques. 



effects on the patient's health and thus their use should be kept to a minimum. Finding an optimal treatment 
strategy is considered a challenging optimal control problem with highly nonlinear transition dynamics |[l"6l . The 
system is represented by a six-dimensional state vector x = (Ti, T2, T*,T£, V, E^j , where T\ > and T2 > 
is the count of healthy type-1 and type-2 cells, T* > and T 2 * > is the count of infected type-1 and type-2 
cells, V > is the number of free virus copies, and E > the number of immune response cells. The two- 
dimensional control vector u = (ei,£2) consists of the dosage of two drugs which is discretized to the four 
actions {(0.3, 07), (0.7, 0), (0, 0.3), (0, 0)} and held constant for At = 5 days. The dynamics of the system and 
physical parameters we used to instantiate the problem are specified in Appendix iDl 

To formulate this task as a reinforcement learning problem of the form given in Eq. (O, we used the following 
settings: the set of initial states is the singleton Xq = {(163573, 5, 11945, 46, 63919, 24)}, which corresponds 
to the unhealthy locally stable equilibrium (i.e., a high number of HIV-infected cells). The reward is composed 
from the number of infected and uninfected cells plus an additional term reflecting the cost for using a drug: 
g{x(t),u(t)) = -0.1V(t) + l0000E(t) - 20000ei(t) - 20000e 2 (t). The discount factor is set to 7 = 0.98, 
and each policy is evaluated for H = 300 steps. Note that in this domain, unlike in all the previous ones, the 
reward is not upper-bounded (and thus the optimistic tree development strategy from Eq. (TT21) cannot be applied). 
Moreover, the values of the state variables can vary over a large range, from up to the order of 10 6 ; to counter 
any unwanted scaling effects in our learning methods we transformed the state variables by taking their logarithm. 

5.2 Contestant methods 

We compare OLT policies against both pure DPS approaches and pure LT approaches. From the point of view of 
DPS, we consider three alternative policy representations (neural networks, adaptive radial basis functions and 
optimized look-ahead trees) and two alternative optimizers (cross-entropy and GPO). Our pure LT approaches 
are composed of the uniform tree development strategy, the optimistic development strategy proposed in |[23l and 
two greedy tree development strategies. 

DPS Representations. We consider the following two common policy representations: 

• Neural network. This representation is probably the most widely used in the direct policy search literature 
fill . The policy is represented by a fully connected feed-forward neural network, which has one hidden 
layer and one output layer with one neuron per possible action. Given the current state, the neural network 
computes one activation score per action and the policy returns an action with maximal score. Hidden 
nodes have tanh activations and their number is a hyper-parameter that enables to control the complexity 
of the policy; output nodes have a linear activation function. 

• Adaptive radial basis functions. In this representation proposed in (6l, a policy is encoded through a set of 
basis functions that are attached to particular actions. Given the current state, the policy works by searching 
for the nearest basis function ("nearest" being measured under the Mahanalobis distance metric) and by 
returning the action attached to this basis function. Each single basis function is parameterized by the 
location of its center, the length-scales along each dimension and the recommended action. The number of 
basis functions used to encode the policy is a complexity hyper-parameter that has to be tuned. 

Table[T]summarizes the characteristics of the three policy representations that we consider in this paper. Each 
representation has a hyper-parameter that enables to control the policy complexity. Note that optimized look- 
ahead trees are unique in the sense that the number of their parameters - hence the complexity of the associated 
direct policy search problem - do not depend on the value of this hyper-parameter. 

DPS Optimizers. Cross-entropy (CE) B4l is a versatile global optimization technique that is widely used 
in direct policy search [4]. In particular, this method was used in previous work 02l . under the alternative 
name of estimation of distribution algorithm. The chief advantage of cross-entropy (and the main reason for it 
being highly popular) is that it is very easy to implement and, because it is population-based, can deal with a 
potentially very large number of sample locations (i.e., CE operates only on small batches of sample locations 
the size of which is constant throughout, whereas GPO has to consider all current sample locations to determine 
the next one). The algorithm works by fitting a distribution to the best currently found solutions and by using this 



Name 



Policy representation 



Hyper-parameter 



Number of parameters 



Neural network (NN) Feed forward neural network Number of hidden neurons (n x + 1) X nHidden + 

with tanh activation and one (nHidden) (nHidden + 1) X K 

hidden layer 

Radial basis functions Adaptive radial basis functions Number of basis functions 2xn t x nBF 

(RBF) with Mahalanobis distance as in (nBF) 

m 

Optimized look-ahead Look-ahead tree (algo 2) with Budget of node expansions 3 x n x 
trees (OLT) optimized node scoring heuris- (/i m ax) 

tic e(- ; 8) as in Eq. d 1 31 > 

Table 1: Summary of the policy representations compared in this paper. n x denotes the dimensionality of the 
state space and K is the number of actions. 



distribution to sample new candidate solutions. In our case, we use a simple variant of cross-entropy that relies 
on a multi-variate Gaussian distribution with diagonal covariance matrix. This distribution is first initialized to 
cover the whole search space: we start with a Gaussian with zero mean and a diagonal covariance matrix, the 
entries of which are equal to the the square of the half-length of an interval centered at zero and containing the 
corresponding coordinate of the search space. The algorithm then draws a number Nc e of observations from 
this distribution, where each observation is a parameter vector and represents a possible policy, and evaluates the 
Nce resulting policies and sorts them according to their performance. It then picks a number Mqe of the top 
best performing policies, and uses them to update the generating distribution: the old mean is replaced by the 
sample mean and each diagonal entry of the covariance matrix is replaced by the per-coordinate variance of the 
Mqe best parameter vectors. These two steps are iterated until a stopping condition, in our case a predefined 
maximum number of iterations, is reached. 

LT strategies. Classical look-ahead tree policies rely on fixed generic node scoring functions. In order to 
demonstrate the significance of learning the node selection rule in a problem-driven way, we compare our ap- 
proach to look-ahead tree policies using e um f orm (uniform tree development, Eq. ITTb . e °P timistlc (the method 
proposed by |[23l . Eq.[T2l) and two forms of greedy tree development: 

^reedy-l^ ._ ^ ^reedy-2 ^ . = ^ ^ (2fi) 

5.3 Performance comparison 

We start by comparing the performance we obtain with the different approaches discussed previously. Note 
that, contrarily to OLT, NN and RBF typically involve solving challenging global optimization problems with 
hundreds or thousands of parameters. While GPO is quite efficient for problems that have a reasonable number 
of parameters, this approach requires sophisticated approximations to scale to higher-dimensional problems (e.g., 
see Il38l0 . In this paper we use a naive textbook implementation of GPO that is able to solve OLT optimization 
problems, but that suffers from scaling problems when the number of samples increases, which happens to be 
problematic in high-dimensional problems. Therefore, we use CE to learn NN and RBF based policies. To make 
the comparison fair, we also use CE to optimize the parameters of OLT policies in this first part of our empirical 
study. A comparison between CE and GPO for learning OLT policies is provided in Section [53] 

Experimental protocol. We use the same test procedure for all policies and the performance is measured as 
the discounted sum of rewards obtained when executing the policy for H steps (see Section I5TI ). starting from 
the initial statd3 Xq. 

The parameters of CE were tuned by hand so as to give in each case a good result with a reasonable amount 
of computation. The result of this tuning is given in Table [2] Note that, since OLT involves lower-dimensional 

7 Recall that for each of our four benchmark domains the set Xo only contains one single initial state xo. Yet as we will see below, 
robustness with respect to perturbations of the initial state is one of the strengths of look-ahead tree policies, so that optimizing only over 
one initial state is justified. 
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Table 2: Cross-entropy parameters for the four problems and the three policy representations. 
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50 
89.9 


50 
134.9 


20 
3.89e4 
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nBF 

Performance 


30 
73.4 
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30 
4.09e4 


15 
1.39e9 
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Budget 
Performance 


31 
93.2 


1365 
145.2 


40 
4.07e4 


85 
4.22e9 



Table 3: Quality of learned policies with neural networks, radial basis functions and optimized look-ahead trees. 
For each problem, the best performance is shown in bold. For each method and each problem we also display the 
value of the tuned hyper-parameter. 

optimization problems than the two other representations, the required optimization budget (Nce X niter) is 
lower on all problem for this representation. For the RBF representation, the CE has to optimize both continuous 
parameters (RBF centers and lengthscales) and discrete parameters (action assignments). To optimize these 
discrete parameters, we chose a multinomial distribution with Dirichlet prior where the initial count of each 
action was set to 10. 

The hyper-parameters of the three policy representations were tuned by grid-search. For both the number of 
hidden nodes in the NN representation and the number of basis functions in the RBF representation, we tested 
the following values: {1, 2, 3, 4, 5, 7, 10, 12, 15, 20, 30, 40, 50}. As in (23l|32l, we tested budget values for look- 
ahead tree policies that correspond to the number of nodes of fully developed trees of varying depth d = 1, . . . , 8. 
These budgets values are as follows: h max e {1, 1 + K, 1 + K + K 2 , 1 + K + K 2 + K 3 , ...}. 

Performance comparison. Table [3] reports the best scores obtained by the tree kinds of policies as well as 
the values of the tuned hyper-parameters. We observe that OLT works significantly better than the other DPS 
methods on three out of the four benchmark domains. On the HIV domain, OLT enables us to obtain slightly 
better results than the state of the art lfT6l (4.22e9 against 4.16e9), whereas both of the two other representations 
only manage to reach policies with very moderate scores (« le9). RBF slightly outperforms OLT on the acrobot 
domain. However, as we will see later, this result holds thanks to a very careful tuning of the number of radial 
basis functions, whereas OLT works well for a wide range of budget values. 

Relevance of learning the node scoring function. We have seen that the OLT policy representation enables to 
reach policies outperforming those obtained with neural networks and radial basis functions. One could wonder 
whether this result could be obtained by using look-ahead trees without learning. We therefore performed a series 
of experiments that, for various budget values, compare the performance of look-ahead tree policies, with and 
without learning. Note that in these experiments, the budget is set before optimization, hence we optimize one 
OLT policy per tested budget value. 

The results of our comparison between OLT policies and traditional LT policies is given in Figure [6] Best 
scores are obtained with OLT policies on three domains out of the four (inverted pendulum, acrobot and HIV). 
Furthermore, thanks to learning, significantly lower budgets are required to reach a given level of performance, 
again on three domains out of the four (double inverted pendulum, acrobot and HIV). The most impressive results 
are obtained on the HIV domain, for which an OLT policy with a budget of 2 node expansions already performs 
better than all other LT policies with a budget up to h max = 87381 node expansions (which corresponds to fully 
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Figure 6: Performance of optimized look-ahead tree policies (OLT) vs. baseline look-ahead tree policies for 
various budget values. Since 7 = 1 in the acrobot domain, the greedy-2 policy degenerates to greedy- 1 and 
the u-score based policy degenerates to the uniform policy. Since the reward is not upper-bounded in the HIV 
domain, the u-score is not defined for this problem. As explained in the text, because the time required to make a 
single decision increases linearly in the number of node expansions and because we have to evaluate many more 
policies to produce a single OLT curve than we have to produce a LT curve, we did not evaluate OLT on the same 
large budget values as LT. 




Figure 7: Trajectory of the HIV system when controlled by an OLT policy with a budget of 85 node expansions. 
The first six panels show the development of the six system states, the following two panels show the dosage of 
the RTI and PI drugs applied, and the last panel shows the reward obtained in each step. 



developed trees of depth 8). An OLT with such a small budget consists in first expanding the node corresponding 
to the current state and then expanding one of its successor nodes. Our results show that carefully selecting this 
successor state is much more efficient to solve the HIV problem than developing large trees in a generic way. In 
the same spirit, on the acrobot problem, an OLT policy with only 8 node expansions performs slightly better than 
a uniformly developed tree with a budget 9841 nodes (corresponding to a tree of depth of 7). 

On the double inverted pendulum problem, OLTs are ultimately outperformed by look-ahead tree policies 
using n-score. However, we observe that our approach is much better able to deal with a constrained computa- 
tional budget. For example, using only 341 node expansions we obtain a reasonably well performing policy, that 
outperforms all other generic tree development strategies even when they have very large budgets. 

On the inverted pendulum problem, we see that all LT policies achieve a near-optimal performance (which 
we can compute for this domain, e.g. see E51 ) already with a budget of 5. It thus seems that there is little interest 
in learning a specific node scoring function for this problem. 

Illustration of the HIV policy. In order to allow a direct comparison between the performance of our method 
on the HIV domain with the performance given in the earlier related work Ifl6l 171 l6l. we plot in Figure UJ the 
trajectory that we obtain on the HIV system. These results show that our policy applies RTI and PI drugs in a way 
which is very similar to what other state-of-the-art policies do (which, however, are obtained in a fundamentally 
different manner, e.g., by fitted Q- value iteration using millions of sample transitions). 

5.4 Sample efficiency 

We have seen that the OLT representation enables to reach high-performance policies, which often outperform 
alternative DPS representations. Besides performance, another aspect of major importance in DPS is the sample 
efficiency of the learning process, i.e. how fast good policies can be obtained. We study sample efficiency by 
looking at the performance of our different methods in function of two metrics: the number of policy evaluations 
performed and the number of transitions simulated. 

Experimental protocol. The most common solution to measure sample efficiency in a DPS scheme is to look 
at the number of policy evaluations required to reach a certain level of performance. This measure corresponds to 
the number of different parameter values that have been tried by the optimization algorithm. Here, we consider 
two different optimization algorithms: cross-entropy and GPO. Cross-entropy is tuned as previously and GPO is 
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Figure 8: Performance vs. number of policy evaluations. From left to right: OLT with GPO, OLT with CE and 
RBF with CE. From top to bottom: inverted pendulum, double inverted pendulum, acrobot handstand and HIV 
drug treatment. 

instantiated as follows: as kernel we chose, as it is standard for GP regression, the squared exponential given in 
Eq. (fT8l) . where the hyperparameters are found for each batch of data via marginal likelihood optimization Il42ll . 
In this optimization, the best setting of the hyper-parameters found in the previous iteration is used as the mean 
of the hyperprior in the next iteration. To generate an initial batch of training data, we generated 10 samples (100 
samples in the double inverted pendulum domain) via Latin hypercube sampling; these initial samples are taken 
into account when we compare sample efficiency. To find at each iteration of GPO the best next sample location, 
we optimize the EI acquisition function from Eq. (T25T ) with ( = 0.01 using DIRECT. 

We are interested in two questions: how do GPO compare against CE when using the OLT representation? 
and how does the choice of representation impact the sample efficiency? To answer these questions, we selected 
the following three setups: OLT with GPO, OLT with CE and RBF with CE, and plotted the corresponding 
learning curves for different values of the h max and nBF hyper-parameters. These plots are given in Figured] 
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Figure 9: Performance vs. number of simulated transitions. 



Impact of the optimizer. We observe that GPO requires about one order of magnitude fewer policy evaluations 
than CE on all four problems (note that the GPO curves stop at 500 policy evaluations, whereas we display CE 
results until 5000 evaluations). We observe no notable difference between the performance of the final policies 
obtained with GPO and CE, and the overall shapes of the learning curves are similar. Note that CE is much easier 
to implement than GPO. Therefore, in the light of these results, we suggest using CE when sample complexity is 
not a big problem and selecting GPO in the case where the number of policy evaluations is a strong applicative 
constraint. Using GPO requires a bit more work, but is profitable since it enables to roughly perform learning 
one order of magnitude faster. 

Impact of the representation. Our results show that optimizing look-ahead trees is substantially more sample 
efficient than optimizing the parameters in a basis function representation, this difference being 5K versus 50K- 
100K policy evaluations when using the cross-entropy method. The same kinds of differences are observed when 
comparing against the NN representation. One obvious explanation for this is that optimizing the node scoring 
function involves fewer parameters, which results in the search space having lower dimensionality. As mentioned 
previously, a key advantage of OLT is that the complexity of the policy (i.e. the computational budget /i max ) can 
be arbitrarily scaled without increasing the number of parameters to be optimized. In contrast, for both the neural 
network and adaptive basis function representation higher complexity can only be achieved at the expense of a 
larger number of policy parameters, which generally involves harder optimization problems. 

Performance vs. number of simulated transitions. An important characteristic of OLT policies is that they 
use some online computational resources to take their decisions. Specifically, in order to take a single decision, 
a look-ahead tree policy requires simulating Kh m&yi transitions using a generative model of the problem, since 
computing the policy involves expanding /i max nodes and since expanding a node invokes the model once per 
possible action. Hence, it is also important to study the performances of the algorithm with respect to the number 
of simulated transitions. This new sample complexity measure is computed in the following way. Note that 
one policy evaluation requires making a trajectory of H steps (assuming we have a unique training initial state) 
and that the simulator is called once per step of this trajectory. The number of simulated transitions per policy 
evaluation is thus H for NN and RBF policies. If we add the additional simulation cost of OLT policies, the 




Figure 10: Performance vs. hyper-parameter value. Top: Performance of NN in function of the number of hidden 
neurons. Bottom: Performance of RBF in function of the number of basis functions. 

number of simulated transitions per policy evaluation becomes H x (1 + Kh mSLX ). 

We compare our three policy representations with the tuned hyper-parameters given in Table [3] Since OLT 
policies with high budget values are strongly disadvantaged w.r.t. the number of simulated transitions, we con- 
sider an additional low-buget OLT setting, where the budget values were chosen by hand to be as small as possible 
while still producing reasonably good policies asymptotically. Figure [9] reports the learning curves obtained by 
CE for the NN and RBF settings and by both GPO and CE for the two OLT settings. 

We observe slightly different behaviors depending on the problem. In the two first problems, the best policies 
are obtained with NN and RBF for small number of simulated transitions, and after a given threshold, OLT 
policies become better. This was expected since evaluating a single OLT policy can already require simulating 
thousands of transitions. In the acrobot and HIV problems, OLT policies outperform the other ones on almost 
the whole x-axis range, which was more unexpected. This proves that the larger number of transitions required 
to take decisions is partly compensated by the smaller number of parameters to optimize. 

On all four problems, we observe that (i) low-budget OLT policies outperform tuned OLT policies when the 
number of simulated transitions is small, (ii) policies optimized with GPO always outperform their counterparts 
optimized with CE, and (iii) the asymptotically best (or almost best in the case of acrobot) policies are based on 
optimized look-ahead trees. 

5.5 Robustness with respect to hyper-parameters 

In addition to performance and learning efficiency, another important property of DPS approaches from a prac- 
tical perspective is that their hyper-parameters should be easy to tune. Indeed, tuning these parameters is part 
of the whole policy learning process, hence the more trial and error is required, the longer it takes to obtain 
high-performance policies. The behavior of OLT policies in function of the h m3iX hyper-parameter was shown 
in Figure [6] We report the performance of the other two kinds of policies in function of their respective hyper- 
parameters in Figure [TOj Since CE is a stochastic optimization algorithm, we performed these experiments ten 
times for each setting and report the empirical means and standard deviations. 

Depending on the problem, NN and RBF are more or less sensitive to the setting of their complexity pa- 
rameter. For both methods, there are some problems for which tuning the complexity hyper-parameter is rather 
easy, and others for which it is much harder. We observe that on the acrobot benchmark, which is the only one 
where NN and RBF are able to produce policies competitive with OLT policies, both the performance of NN and 
RBF policies have large variances: performance varies strongly both for increasing the complexity (number of 
nodes/basis functions) or across different runs in the same setting. To understand this large variance, it should be 
noted that acrobot is different from the previous domains in that there is no partial solution: a policy is either able 
to swing up and successfully balance in the handstand position indefinitely (the reward is > 10 4 ) or not at all (the 




Figure 1 1 : Regret for NN, RBF and OLT policies on the inverted pendulum domain when the testing initial state 
differs from the training initial state. 



reward is of order 10 3 at most). Successful balance is only possible if the system approaches the tiny region in 
the state space from where LQR can take over. 

Although we observe in some cases that well-performing policies can be found using a fairly compact rep- 
resentation (e.g., in the double inverted pendulum domain, already one hidden neuron can give a good perfor- 
mance), our results indicate that, seen across all domains, finding the best possible complexity parameter for NN 
and RBF requires a larger amount of trial and error experimentation. At the opposite, as shown in Figure [6] OLTs 
are rather easy to tune, since, in general, increasing the budget increases the quality of the policy or only slightly 
degrades it. In all our experiments, a default value corresponding to a fully developed tree of depth 3 or 4 yielded 
good performing policies. 



5.6 Robustness with respect to initial states. 

In DPS, policies are optimized w.r.t. a particular distribution over initial states. An important issue is that this 
distribution may not perfectly match future unexpected usage of the learned policy. In general, a preference is 
given to policies that are robust w.r.t. such mismatches. To compare the robustness of the NN, RBF and OLT 
policies and study how well they are able to generalize when starting from initial states which differ from those 
used during learning, we performed experiments by evaluating them on perturbed or completely different initial 
states on the inverted pendulum problem and the HIV drug treatment problem. 



Stability on inverted pendulum. To analyze the stability of the various policies on this domain, we started by 
learning NN, RBF and OLT policies with the single initial state xo = (— tt, 0) and then evaluated these policies 
across the whole domain by systematically varying the initial state. Remember that we are able to compute a 
truly optimal policy on the inverted pendulum domain. We use this latter policy to report the regret for each initial 
state, i.e. the difference between the optimal performance and performance gotten from the learned policy. The 
results of these experiments are given in Figure [TT] It can be seen that an OLT policy (budget 20) performs well 
and incurs close to zero regret across large parts of the state space, with an exception being the region around 
(it, 0). When comparing with NN and RBF policies, we see that both representations are less able to generalize 
and can incur comparatively large regrets even for small perturbations. 



Stability on HIV drug treatment. Figure [12] illustrates the robustness of an OLT policy (budget 85) in the 
HIV domain. This time we are no longer able to exhaustively sample the state space; instead we perturb each 
state variable in turn by multiplying each variable of xq independently by a factor ranging from 0.1 to 10. The 
figure then shows that the performance stays largely stable even for large perturbation^. In particular, in all cases 
the performance of the OLT policy remains significantly higher than the best results obtained by NN and RBF 
(> 3e9 vs. « le9). 

8 Note that in doing this we did not pay attention to the physical meaning of the variables; thus by increasing the value of the sixth 
state variable it appears as if the policy begins to perform even better. However, since this state variable denotes the count of healthy cells 
in the blood of a patient, increasing its value by an order of magnitude alters the whole problem. 



Figure 12: Robustness of an OLT policy 
for the HIV drug treatment domain. The 
plot examines how the performance of the 
policy (with budget 85) decreases when 
the initial state for which the policy was 
optimized is perturbed: each curve cor- 
responds to the perturbation of one state 
variable independently. The vertical line 
at x = 100 denotes the unperturbed value. 

6 Related work 

The approach proposed in this paper lies at the intersection of two families of solutions for sequential-decision 
making. We first overview direct policy search in Section 16.11 and then discuss relevant work on look-ahead 
tree search in Section 1631 Section 1531 positions our approach with respect to Model Predictive Control tech- 
niques. Finally, Section [63 suggests that optimized look-ahead tree policies belong to a larger emergent class of 
techniques: parameterized algorithms for decision-making. 

6.1 Direct policy search 

Direct policy search is a widely used class of solutions that comes from reinforcement learning. For a general 
overview over the field of reinforcement learning, refer to one of the books of ll48l l2l I4T1 l4ll. Over the years, 
many DPS techniques have been proposed and giving credit to every one of them would be hardly possible. 
Individual techniques differentiate themselves by their policy parametrization and the optimization method used 
for identifying parameters leading to a high-performing policy. The main distinction is whether the policy search 
is gradient-free or gradient-based. 

Gradient-based DPS, which very often also goes by the name of policy-gradient method, follows an iterative 
optimization scheme and uses the gradient of the value function to adapt and change the policy parameters such 
that performance increases. This requires the value function to be a smooth function of the policy parameters, 
which is typically achieved by considering compatible stochastic policies. Often however, the gradient cannot be 
computed analytically since the value function itself is not available in closed form. Instead, the gradient has to 
be estimated using, e.g., finite difference approximation, Monte-Carlo rollouts or value function approximation 
which leads to the actor-critic methods. One of the best known examples is probably the natural policy gradient 
method described in HD1 : another more recent example is the work in |[T4ll . While impressive results have been 
obtained in particular for learning controllers in robotics, policy-gradient methods do have some weaknesses; the 
most notable ones being a high sensitivity to the initial value of the policy parameters (the starting point of the 
iteration), abundance of local minima, and difficulties when the return is noisy or stochastic. Note that in this 
paper we do not deal with gradient-based policy search. 

Gradient-free DPS on the other hand finds the best policy parameters via derivative-free global optimization. 
Its main strength is simplicity and generality; the latter meaning that, because it is derivative-free, the repre- 
sentation of a policy by any algorithm with adjustable parameters is admissable, and not only those for which 
the value function will be differentiable. Since gradient-free DPS techniques perform a global search in the 
parameters space, they do not suffer from local minima or from the problem of having to guess a good initial 
solution; these techniques were also shown to cope well with stochastic returns and hidden states. For certain 
domains in reinforcement learning such as Tetris, the best performing policies known today have been obtained 
by gradient-free DPS (see l|49ll and follow-up work). The weakness of this approach is that conceptually it is 
less sample-efficient than policy-gradient methods and thus will require a substantially higher number of policy 
evaluations. This comes as a consequence of having to solve a global optimization problem over a potentially 
high-dimensional search space; however, by carefully designing the policy parametrization and choosing power- 
ful optimizers sample efficiency is often improved. 

Early examples of DPS came under the guise of evolutionary approaches to reinforcement learning, or neu- 
roevolution, and consisted of neural networks as policy representation, with the weights making up the policy 




parameters, and variants of genetic algorithms acting as global optimizer. Examples can be found in 071 and 
|[T8l . where later work also considered optimization of the network structure B71 . or using recurrent neural net- 
works to better cope with hidden states |fl9l . A recent comparison of these methods can also be found in iTSTTl and 
1261. 

As an alternative to genetic algorithms, more recent work started to explore the use of the cross-entropy 
method H4l . or variants such as the covariance matrix adaptation evolution strategy CMA-ES GUI as global 
optimizer for policy search. Examples include (22), where the policy was represented by a simple linearly 
parametrized function, IT261 . where the policy was represented by a neural network, or O, where the policy was 
represented by adaptive radial basis functions. These latter two were also used in this paper as baseline methods 
during the experimental evaluation of our look-ahead trees. Another example for a policy parametrization is to 
use domain-specific building blocks, such as motor primitives, as it was done in E71 and l|29l to optimize the 
gait of the AIBO quadrupedal robot. A different kind of policy representation is used in ll50l to learn a policy for 
the game of Ms. Pac-Man; here the policy is represented by a list of domain-specific parameterized rules. 

Section |4] described a third option for the global optimization part: Gaussian process optimization Il35l l3l. 
GPO can achieve very good sample efficiency and was previously considered for policy search in [291 and ifPTl . 
In the end however, the choice of which global optimizer one uses will always be secondary; what is more 
important is how the policy is represented and to what extent this representation facilitates the optimization 
process by shaping the "fitness landscape". 

6.2 Look-ahead tree search 

The algorithm studied in this paper can also be related to the larger field of tree-based planning and search. One of 
the most seminal works in this field is the A* algorithm ETl which uses a best-first search to find the shortest path 
from a source state/configuration to a goal state/configuration. The conceptual difference between A* and related 
methods on the one side and what we are presenting here on the other side is that our method implements an online 
and finite computational budget mechanism: a search tree is grown at each decision-making step, and only a finite 
number of node expansions are allowed during tree development (the larger the computational budget, the better 
the decision). In A* the search tree needs to be grown until a goal state is reached. Our method is thus capable of 
online planning and producing closed-loop policies, whereas A* is not. More specifically, the algorithm studied 
in this paper can also be interpreted as a method for learning an exploration strategy in a tree. In A*, the function 
used for evaluating the nodes is the sum of two terms: the length of the so far shortest path from the source to the 
current node and an optimistic estimate of the shortest path from this node to the goal (a so-called "admissible" 
heuristic). Several authors have sought to learn good admissible heuristics for the A* algorithm. For example, we 
can mention the LRTA* algorithm E8l which is a variant of A* and which learns over multiple trials an optimal 
admissible heuristic. More recent work for learning strategies to efficiently explore graphs have focused on the 
use of supervised regression techniques using various approximation structures to solve this problem (e.g., linear 
regression, neural networks, fc-nearest neighbors); for example, see ll30ll34ll52ll2Tl . The objective of which node 
to best explore next also arises in the context of game-playing; here we can mention, e.g., Monte-Carlo tree search 
lTT2l . In particular progressive strategies which widen up the actions (nodes considered for expansion) such as in 
P31 [121 [9J and which can also be used for continuous action spaces B3l could be a promising enhancement for 
the policy search technique presented herein. 

6.3 Model predictive control 

Model Predictive Control techniques have originally been introduced as ways to stabilize large-scale systems with 
constraints around equilibrium points (or around a reference trajectory) |[36l[T0l[T5l . They exploit an explicitly 
formulated model of the problem and solve in a receding horizon manner a series of finite time open-loop deter- 
ministic optimal control problems. In such they are very much related to look-ahead tree techniques. Actually, 
a MPC technique that searches for the first action of the optimal sequence of actions over the finite optimisa- 
tion horizon through (clever) exploration of a tree is a look-ahead tree technique. However, most of the control 
techniques labelled as MPC techniques do not use tree-exploration for identifying (near-) optimal sequences of 
actions. They rather assume strong regularity assumptions on the system dynamics and the reward function (e.g., 
linearity) - that we do not make here - which are exploited to reformulate the search for an optimal open-loop 



sequence of action as a standard mathematical programming problem (e.g., convex optimisation problem, mixed 
integer programming problem). This reformulation of the problem as a mathematical programming problem of- 
ten leads to techniques that are able to scale well to very large state-action spaces, provided of course that the 
regularity assumptions hold. 

6.4 Parameterized algorithms for decision-making 

Our work relies on a rather simple and generic research methodology: for a given targeted class of problems, (i) 
identify an algorithm skeleton that is believed to provide promising solutions to problem instances, (ii) parame- 
terize this algorithm and (iii) optimize the parameters in a problem-driven way, through DPS, i.e., through direct 
optimization of the algorithm performance. This methodology can be applied to a wide range of problem kinds 
and has already inspired several authors. The system proposed in places an optimisation layer on top of an 
approximate value iteration algorithm to optimize the location of its basis functions. In iTTBl (Section 5.3), the au- 
thors consider multi-stage stochastic programming techniques and optimize the scenario trees using Monte-Carlo 
methods. Closer to our work, it is proposed in ifTTTl to parameterize a tree-search technique for decision-making: 
upper confidence trees. In this work, the parameters enable to control the simulation policy used to estimate 
long-term returns within the upper confidence tree algorithm. 

Parameterized algorithms have also been shown to be relevant to solve various kinds of exploration / ex- 
ploitation dilemma in a problem-driven way. References ll33l and 0T1 propose to learn exploration / exploitation 
strategies for multi-armed bandit problems either by using the same kind of parameterizations (a simple linear 
function) and the same kind of optimizers (derivative-free global optimizers) as ours or by searching in a space of 
formulas. Reference |8| extends this idea to the exploration / exploitation dilemma that occurs in single-trajectory 
reinforcement learning. Here, the parameters are no more real-valued vectors, but rather small formulas that de- 
pend on the internal variables of the policy. Given a target class of Markov decision problems, this method is able 
to learn high-performance policies that outperform state-of-the-art generic reinforcement learning algorithms. 

We believe that with the recent progress in derivative-free global optimization (with algorithms such as cross- 
entropy, CMA-ES and GPO) and the continuously growing available computing power, the class of such pa- 
rameterized algorithms optimized in a problem-driven way has high chances to grow in the near future, since 
they offer a systematic way to improve upon generic solutions, by exploiting problem-dependent characteristics 
through learning. 

7 Conclusion 

This paper has introduced optimized look-ahead tree policies, a novel technique that bridges the gap between 
two major families of solutions for solving sequential decision-making problems: direct policy search (DPS) and 
look-ahead tree (LT) policies. Our approach uses a new way of representing a policy by using parameterized 
look-ahead trees. In this representation, the parameters of the policy over which optimization takes place encode 
the node scoring function by which the tree is grown (until a predefined computational budget is exhausted) 
every time an action is required from the system. This approach manages to combine the best of both DPS and 
LT approaches. From the point of view of DPS, optimized look-ahead tree policies provide a highly-generic way 
of representing policies that alleviates the need to choose a complex function approximator. From the point of 
view of LT, DPS is the key that enables to significantly reduce the online computational requirement, through a 
principled offline parameter-learning procedure. 

We have shown through an extensive experimental study that our approach has several qualities: it (1) pro- 
duces high-performance policies that often outperform both pure LT policies and pure DPS policies, (2) requires 
at the offline training stage a substantially smaller number of policy evaluations than other DPS techniques, 
thanks to a small parameter space whose size does not depend on the policy complexity, (3) is easy to tune since 
it avoids having to choose a particular function approximator and (4) results in policies that are quite robust with 
respect to perturbations of the initial states. Compared to traditional LT policies, we have shown that learning a 
node scoring function can result in substantial savings in terms of required online computational resources. 

This paper has focused on a particular kind of sequential decision-making problems: finite actions, deter- 
ministic transitions, known transition and reward function (i.e., a generative black-box model which allows us to 
simulate arbitrary transitions), and an "informative" reward. However, we believe that except for the last item, 



Table 4: Physical parameters of the inverted pendulum domain 



Symbol 


Value 


Meaning 


a 


9.81 [m/s 2 ] 


gravitation 


m 


I [kg] 


mass of link 


I 


1 [m] 


length of link 




0.05 


coefficient of friction 



all the other items on this list can be addressed in future work. For example, large number of actions means that 
we can no longer exhaustively generate all successor states when predicting one step ahead; instead, one would 
have to use sampling techniques or progressive widening techniques which is also an active research topic in tree- 
based search P31 [121 l9l 1431. Deterministic transitions can be relaxed to weakly stochastic transitions (weakly 
meaning that there is only a small number of possible successor states) as was done in [7] to extend the optimistic 
strategy from [23] to sparse stochastic systems. And finally, in application scenarios where a generative model is 
not available from the beginning, one could also try to integrate model-learning over time into the policy search 
framework (e.g., see |[25l ). 
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A Dynamic model of the inverted pendulum 

Refer to the schematic representation of the inverted pendulum given in Figure [5^. The state variables are the 
angle measured from the vertical axis, <p(t) [rad], and the angular velocity <p(t) [rad/s]. The control variable is 
the torque u(t) [Nm] applied, which is restricted to the interval [—5, 5]. The motion of the pendulum is described 
by the differential equation: 

4>(t) = — = f-/W>(i) + mgl sin 0(t) + u(t)) . (27) 

The angular velocity is restricted via saturation to the interval <fi € [—10, 10]. The values and meaning of the 
physical parameters are given in Table |4] 

The solution to the continuous-time dynamic equation is obained by writing Eq. (|27T ) as a first-order system 
and using a Runge-Kutta solver with h = 40 intermediate steps. The time step of the simulation is At = 0.2 
sec, during which the applied control is kept constant. The 2-dimensional state vector is x(t) = (<j)(t), <p(t)), the 
scalar control variable is u(t). Since our algorithm requires a finite set of possible actions, we discretized the 
continuous control space into 5 discrete action choices a S {—5, —2.5, 0, 2.5, 5}. 

B Dynamic model of the double inverted pendulum 

Refer to the schematic representation of the double inverted pendulum on spring-linked carts given in Figure |5J). 
The state variables are, for each cart i = 1,2, the angle of displacement measured from the vertical axis, 6i{t) 
[rad], the angular velocity 6i{t) [rad/s], the position of the cart Xi(t) [m] measured from the origin (note x\{t) < 
X2(t)), and its velocity Xi(t) [m/s]. The vector- valued control is the force Ui(t) [Nm] applied to each cart, which 
is restricted to the interval [—2, 2]. The system as a whole is described by the system of differential equations 



Table 5: Physical parameters of the double inverted pendulum domain 



Symbol 


Value 


Meaning 


g 


9.81 [m/s 2 ] 


gravitation 


L 


1.0 [m] 


half-length of the track 


I 


0.5 [m] 


half-length of a pole 


m c 


1.0 [kg] 


mass of a cart 


m p 


0.1 [kg] 


mass of a pole 


He 


0.0005 


coefficient of friction of a cart 


t'p 


0.000002 


coefficient of friction of a pole 


K 


2.0 


coefficient K of the spring 


Is 


0.5 [m] 


relaxed length of the spring 


I smin 


0.1 [m] 


minimum length of the spring before deformation 


I smax 


1.5 [m] 


maximum length of the spring before deformation 



(see online appendix of 11231 ): 

Ut) = [b 2 i{t)a\\t) - a 2 H\{t)]/[af{t)af(t) - a 11 a 22 ], i = 1,2 (28) 
Xi(t) = [b}(t) - a n 6i(t)]/at 2 (t), i = 1,2, (29) 

where 

a} 2 (t) := -cos0j(t) 
a 21 (t) := lm p cos 9i(t) 
b}(t) := gsm.0i(t) - fi p 9i(t) / \lm p ) 
b 2 (t) := lm p 6 2 (t) sm 9i(t) - f t {t) + /i c signet)) 
/i(t) := u<(t) + - \x 2 (t) - Xl (t)\) 
a 11 := 41/3 
a 22 := — (m c + m p ). 

The angular velocity is restricted via saturation to the interval <\>; L € [—10, 10], the velocity of the cart to 
Xi € [—5,5]. The system knows two terminal conditions which lead to a stopping of the process: collision 
between one of the carts and a wall, and collision between the carts. More specifically, these conditions are 
implemented as follows: the temporal evolution of the system is halted if at any time t at least one of the 
following is true: 

• \xi(t)\ > L (first cart collides with left wall or second cart collides with right wall) 

• %2{t) < %i(t) (first cart has passed the second cart) 

• \x2(t) — xi(t)\ £ [I smin, I smax] (outside minimal and maximal length of spring before deformation) 

The values and meaning of the physical parameters are given in Table [5J 

The solution to the continuous-time dynamic equations is obtained by writing Eqs. (I28l)-(l29b as a first or- 
der system and using a Runge-Kutta solver with h = 10 intermediate steps. The time step of the simula- 
tion is At = 0.1 sec, during which the applied control is kept constant. The 8-dimensional state vector is 
x(t) = {x\{t),X2(t),x\(t),X2{t), 9i(t) , #2(i)j ^i(t) j 02(*))> the control vector is u(t) = (ui(t),U2(t)). Since 
our algorithm requires a finite set of possible actions, we discretized the control space into 4 discrete action 
choices a G {(-2, -2), (-2, +2), (+2, -2), (+2, +2)}. 



Table 6: Physical parameters of the acrobot domain 



Symbol 


Value 


Meaning 


9 


9.8 [m/s 2 ] 


gravitation 


mi 


l [kg] 


mass of link i 


k 


1 [m] 


length of link i 




0.5 [m] 


length to center of mass of link i 


h 


1 [kg ■ m 2 ] 


moment of inertia of link i 



C Dynamic model of the acrobot 

Refer to the schematic representation of the acrobot domain given in Figure [5fc. The state variables are the 
angle of the first link measured from the horizontal axis, 9\{t) [rad], the angular velocity 9\{t) [rad/s], the angle 
between the second link and the first link 9 2 (t) [rad], and its angular velocity 9 2 (t) [rad/s]. The control variable 
is the torque r{t) [Nm] applied at the second joint. The dynamic model of the acrobot system is ll46l : 

hit) = - J^(d 2 (t)e 2 (t) + Mt)) (30) 

h{t) = ^w(T(t) + jrzMt) ~ ™ 2 hl c2 9\(t) 2 sin0 2 (i) - Mt)) (31) 

where 

di(t) :=mil 2 cl + m 2 {l\ + lc2 + 2/l l c2 cos 6 2 (t)) + Ii + I 2 
d 2 (t) :=m 2 (l 2 c2 + hl C 2 cos 2 (*)) +h 

Mt) ■= - m 2 hl c2 e 2 {t) 2 s\n6 2 {t) - 2m 2 hl c2 6 2 (t)0i{t)sm0 2 {t) + (miZ c i + m 2 h)g cos 6 x (t) + ^ 2 (i) 
Mt) ■-m 2 l c2 gcos(e 1 (t) + e 2 (t)). 

The angular velocities are restricted via saturation to the interval Q\ € [— 4-7T, 47r], and 6 2 € [— 9ir, 9tt]. The values 
and meaning of the physical parameters are given in Table |6j we used the same parameters as in B8l . 

The solution to the continuous-time dynamic equations in Eqs. (I30l-(|3TI) is obained using a Runge-Kutta 
solver with h = 20 intermediate steps. The time step of the simulation is At = 0.2 sec, during which the applied 
control is kept constant. The 4-dimensional state vector is x(t) = (0i(t), 9 2 (t), 9i(t), 9 2 {t)), the scalar control 
variable is r(t). 

The motor was allowed to produce torques r in the range [—1,1]. Since our algorithm requires a finite set of 
possible actions, we discretized the continuous control space. Here we use three actions: the first two correspond 
to a bang-bang control and take on the extreme values —1 and +1. However, a bang -bang control alone does 
not allow us to keep the acrobot in the inverted handstand position, which is an unstable equilibrium. As a third 
action, we therefore introduce a more complex balance-action, which is derived via LQR. First, we linearize the 
acrobot's equation of motion about the unstable equilibrium (— n/2, 0, 0, 0), yielding: 

x(t) = Ax(t) + Bu(t), 

where, after plugging in the physical parameters of Table |6l 











1 


0" 









-B x {t)->K/2- 


A = 



6.21 




-0.95 






1 




, B = 




-0.68 


, x{t) = 


0i(t) 




-4.78 


5.25 










1.75 




L 02(t) J 



U (t)=T(t). 



Using MATLAB, an LQR controller was then computed for the cost matrices Q = ^4x4 and R = 1, yielding the 
state feedback law 

u(t) = -Kx(t), (32) 



with constant gain matrix K = [-189.28, -47.46, -89.38, -29.19]. The values resulting from Eq. <f32j were 
truncated to stay inside the valid range [—1,1]. Note that the LQR controller works as intended and produces 
meaningful results only when the state is already in a close neighborhood of the handstand state; in particular, it 
is incapable of swinging up and balancing the acrobot on its own from the initial state (0, 0, 0, 0). 



D Dynamic model of the HIV drug treatment domain 

The HIV infection dynamics are described by a six-dimensional nonlinear system with the state vector x(t) = 
(TiCi), T 2 (i), r x *(t), T$(t),V(t), E(t)), where 

1. Ti(t) > (Tl(t) > 0) is the number of non-infected (infected) CD4 + T-lymphocytes (in cells/ml), 

2- T 2 (t) > (T 2 (i) > 0) is the number of non-infected (infected) macrophages (in cells/ml), 

3. V(t) > is the number of free HI virus particles (in copies/ml), and 

4. E{t) > is the number of cytotoxic T-lymphocytes (in cells/ml). 

The dynamics is described by the following system of first-order differential equations (see HI): 

Ti(t) = Ai - diTi(i) - (1 - e 1 (t))k 1 V{t)T 1 (t) (33) 

f 2 (t) = A 2 - d 2 T 2 (t) - (1 - fei{t))k 2 V(t)T 2 (t) (34) 

i?(t) = (1 - ei(t))fci^(t)Ti(i) - 8TZ(t) - miE{t)TZ{t) (35) 

Ti(t) = (1 - f£i(t))kiV(t)T 2 (t) - 57% (t) - m 2 E(t)T^t) (36) 
V(t) = (1 - e 2 (t))N T 6(T?(t) + T 2 *(i)) - c ^(t)- 

[(1 - ei(t))p 1 A;ir 1 (t) + (1 - / £l (t))p 2 fe 2 r 2 (t)]y(t) (37) 

^ ~ Xe + Tt(t) + Tm + K b m ~ Tt(t) + T*(t) + K d m ~ 5EE{t) (38) 

The vector-valued control variable is u(t) = (ei(t),e 2 (t)), where e\ and e 2 corresponds to the dosage of 
the reverse transcriptase inhibitor drug (RTI) and the protease inhibitor drug (PI), respectively. In STI, drugs are 
either fully administered (they are "on") or not at all (they are "off"). A fully administered RTI drug corresponds 
to the value e± = 0.7, while a fully administered PI drug corresponds to the value e 2 = 0.3. This leads to 
a discrete actions space with four possible choices a € {(0.7,0.3), (0.7,0), (0,0.3), (0,0)}. Because it is not 
clinically feasible to change the treatment daily, the state is measured and the drugs are switched on or off once 
every five days. Therefore, the system is controlled in discrete time with a sampling period of At = 5 days 
(during which the chosen controls are kept constant). 

As shown in 111, in the absence of treatment (i.e. E\ = e 2 = 0), the system in Eqs. (f33T>-(l37l> exhibits three 
physical equilibrium points: 

1. an unstable equiliberium point (Ti, T 2 , T{, T 2 *, V, E) = (10 6 , 3198, 0, 0, 0, 10) which represents an unin- 
fected state; 

2. a "healthy" locally stable equilibrium point (Ti , T 2 , T-f , T 2 * , V, E) = (967839,621,76,6,415,353108) 
which corresponds to a small viral load, a high CD4 + T-lymphocytes count and a high HIV-specific 
cytotoxic T-cells count; 

3. a "non-healthy" locally stable equilibrium point (T 1: T 2 ,T?,T£, V, E) = (163573, 5, 11945, 46, 63919, 24) 
for which T-cells are depleted and the viral load is very high. 

Numerical simulations show that the basin of attraction of the healthy steady-state is relatively small in compar- 
ison with the one of the non-healthy steady-state. Furthermore, perturbation of the uninfected steady-state by 
adding as little as one single particle of virus per ml of blood plasma leads to asymptotical convergence towards 
the non-healthy steady-state. 



The solution to the continuous-time dynamic equations in Eqs. (|33T)-(|37T) is obtained by using a Runge-Kutta 
solver with h = 500 intermediate steps. The values and meaning of the constants in the model is the same as in 
ID. 03): Ai = 10,000, di = 0.01, jfei = 8 • 10~ 7 , A 2 = 31.98, d 2 = 0.01, / = 0.34, k 2 = 1 ■ 10" 4 , 5 = 0.7, 
mi = 1 • 10" 5 , m 2 = 1 • 10~ 5 , N T = 100, c = 13, qi = 1, g 2 = 1, A_e = 1, 6 E = 0.3, K 6 = 100, d E = 0.25, 
K d = 500, 5b = 0.1. 
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